Tempo by mode: Is there a difference?
Hard to tell from the histograms:
Look at mean tempo for each mode
- Major: \(\approx 122\) bpm
- Minor: \(\approx 116\) bpm
Is this difference significant? What do we mean by significance anyway?
Structure of a hypothesis test
- Start with a null hypothesis: An assumption on how the data is generated
Structure of a hypothesis test
- Start with a null hypothesis: An assumption on how the data is generated
- Based on this assumption, how likely were we to collect data as extreme as what we have?
- p-value: probability of collecting data as extreme as ours (if null hypothesis is true)
Structure of a hypothesis test
- Start with a null hypothesis: An assumption on how the data is generated
- Based on this assumption, how likely were we to collect data as extreme as what we have?
- p-value: probability of collecting data as extreme as ours (if null hypothesis is true)
- Is the p-value considered low or not?
- Threshold should depend on the context
- Typical thresholds, 0.1, 0.05, 0.01
Structure of a hypothesis test
- Start with a null hypothesis: An assumption on how the data is generated
- Based on this assumption, how likely were we to collect data as extreme as what we have?
- p-value: probability of collecting data as extreme as ours (if null hypothesis is true)
- Is the p-value considered low or not?
- Threshold should depend on the context
- Typical thresholds, 0.1, 0.05, 0.01
- If p-value is below threshold, 2 possible conclusions:
- A rare event just happened, or
- Our assumption in Step 1 was false
Hypothesis test: coin flipping example
I flip a coin 20 times and it came out heads 16 times. Is my coin biased?
Hypothesis test: coin flipping example
I flip a coin 20 times and it came out heads 16 times. Is my coin biased?
- Start with a null hypothesis: Probability of heads \(p = 0.5\)
Hypothesis test: coin flipping example
I flip a coin 20 times and it came out heads 16 times. Is my coin biased?
Start with a null hypothesis: Probability of heads \(p = 0.5\)
- Based on this assumption, how likely were we to collect data as extreme as what we have?
- “As extreme”: 16 or more heads, or 4 or less heads
- Probability of collecting data as extreme as ours: 0.0118
Hypothesis test: coin flipping example
I flip a coin 20 times and it came out heads 16 times. Is my coin biased?
Start with a null hypothesis: Probability of heads \(p = 0.5\)
- Based on this assumption, how likely were we to collect data as extreme as what we have?
- “As extreme”: 16 or more heads, or 4 or less heads
- Probability of collecting data as extreme as ours: 0.0118
Is the p-value considered low or not?
Hypothesis test: coin flipping example
I flip a coin 20 times and it came out heads 16 times. Is my coin biased?
Start with a null hypothesis: Probability of heads \(p = 0.5\)
- Based on this assumption, how likely were we to collect data as extreme as what we have?
- “As extreme”: 16 or more heads, or 4 or less heads
- Probability of collecting data as extreme as ours: 0.0118
Is the p-value considered low or not?
- If p-value is below threshold, 2 possible conclusions:
- A rare event just happened, or
- Our assumption in Step 1 was false
Tempo by mode: Is there a difference?
Two options:
- \(t\)-test
- Null hypothesis: Mean tempo for songs in minor key is the same as that for songs in major key
- Makes more assumptions on the data generation process (“parametric test”)
Tempo by mode: Is there a difference?
Two options:
- \(t\)-test
- Null hypothesis: Mean tempo for songs in minor key is the same as that for songs in major key
- Makes more assumptions on the data generation process (“parametric test”)
- Kolmogorov-Smirnov test
- Null hypothesis: The distribution of tempo for songs in minor key is the same as that for songs in major key
- Less assumptions on data generation process (“non-parametric test”), but rejecting the null gives less information
What is a model?
- A model is a simplified and idealized way to understand a system.
- R4DS: “The goal of a model is to provide a simple low-dimensional summary of a dataset. Ideally, the model will capture true “signals” (i.e. patterns generated by the phenomenon of interest), and ignore “noise” (i.e. random variation that you’re not interested in)."
Two steps to modeling
Step 1: Identify a family of models which express a generic pattern between your variables of interest.
Possible model family: Linear model, i.e. \(child = a_1 + a_2 \times parent\).
- Variables:
child
and parent
- Model parameters: \(a_1\) and \(a_2\)
Many other possible models: linear without intercept, quadratic, exponential, …
Different models within the linear model family
Each line corresponds to a choice of
\(a_1\) and
\(a_2\).
Two steps to modeling
Step 2: Find the model in this family that most closely matches your data.
That is, find specific values of \(a_1\) and \(a_2\) which make the model match the data most closely.
What do we mean by “closely matching the data”?
We choose \(a_1\) and \(a_2\) such that some objective function (loss function) is minimized.
Most common objective: Minimize the sum of squares of the black lines below.
Linear models in R
- Linear regression can be done with the
lm
function
- Syntax:
lm(formula, data = df)
- Formulas look like
y ~ x
, which lm
will translate to a function like \(y = a_1 + a_2 \cdot x\)
Models with categorical variables
Consider modeling valence ~ mode
.
- Does the model \(valence = a_1 + a_2 \cdot mode\) make sense?
- 3 + 4 \(\cdot\) “Major”??
- What R does:
- Choose a baseline category (say, “Minor”)
- Model \(valence = a_1 + a_2 \cdot modeMajor\), where \(modeMajor = \begin{cases} 1 &\text{if "Major"}, \\ 0 &\text{if "Minor"}. \end{cases}\)
- \(valence = a_1\) if Minor, \(valence = a_1 + a_2\) if Major
Additive models
Formula valence ~ loudness + mode
translates to
- \(valence = a_1 + a_2 \cdot loudness + a_3 \cdot modeMajor\), where \(modeMajor = \begin{cases} 1 &\text{if "Major"}, \\ 0 &\text{if "Minor"}. \end{cases}\)
- \(valence = a_1 + a_2 \cdot loudness\) if Minor
- \(valence = (a_1 + a_3) + a_2 \cdot loudness\) if Major
- Same gradient, different intercept
Models with interaction
Formula valence ~ loudness * mode
translates to
- \(valence = a_1 + a_2 \cdot loudness + a_3 \cdot modeMajor + \color{blue}{a_4 \cdot loudness \cdot modeMajor}\), where \(modeMajor = \begin{cases} 1 &\text{if "Major"}, \\ 0 &\text{if "Minor"}. \end{cases}\)
- \(valence = a_1 + a_2 \cdot loudness\) if Minor
- \(valence = (a_1 + a_3) + (a_2 + a_4) \cdot loudness\) if Major
- Different gradient, different intercept
Summary of the course
- Variable types
- Basic objects in R (vectors, lists, data frames)
- Plotting data with
ggplot2
- Transforming and joining data with
dplyr
and tidyr
- Importing and exporting data
- Working with factors using
forcats
- R scripts and R markdown
- Making maps
- Basic statistical testing and modeling
Where do we go from here?
- Read R for Data Science from cover to cover!
- Go through the programming exercises and solutions
- Take short courses on DataCamp
- Writing your own functions and running simulations
- Interactive plots with
plotly
- Advanced mapping with
ggmap
- Predictive models/machine learning with
caret
- Interactive web apps with
shiny
- Text analysis with
tidytext
- Recommmended text: Text Mining with R by Julia Silge and David Robinson (avaible online for free at tidytextmining.com)
Other Stanford courses
- Programming: CS 106A
- Statistical methods: STATS 60, STATS 101
- Data challenge lab: ENGR 150
Thank you! :)